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Abstract 


Simulating  a  computer  run  can  be  an  excellent  method  for  identifying 
performance  bottlenecks  and  is  especially  valuable  when  discussing  systems  that 
do  not  yet  exist.  Traditional  simulations  collect  a  program  trace  and  then  have  a 
simulator  execute  some  subset  of  the  trace  one  instruction  at  a  time. 
Unfortunately,  all  of  the  standard  variants  of  this  technique  are  far  too  slow  to 
use  on  jobs  for  high-end  High  Performance  Computers  and  Supercomputers.  We 
have  developed  an  approach  based  primarily  on  an  analysis  of  the  memory 
access  patterns  and  the  number  of  floating  point  operations  being  executed  that 
will  estimate  the  performance  of  any  run  in  a  small  fixed  amount  of  time  (e.g.,  a 
few  seconds  or  less).  Experience  has  shown  that  the  results  are  nearly  always 
within  a  factor  of  2  of  the  measured  results  and  frequently  are  within  15%  or 
better  of  the  measured  results. 
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1.  Introduction 


Simtilating  a  computer  run  can  be  an  excellent  method  for  identifying 
performance  bottlenecks.  This  can  be  especially  valuable  when  discussing 
systems  that  do  not  yet  exist.  As  such,  it  can  augment  benchmarking  efforts  by 
helping  to  explain,  or  even  predict,  results  as  opposed  to  simply  reporting  them. 
Unfortunately,  traditional  techniques  in  this  field  have  suffered  from  three 
constraints: 

(1)  Describing  a  modem  processor  architecture  with  sufficient  fidelity  to 
ascertain  the  validity  of  the  results  is  difficult. 

(2)  The  simulations  are  so  expensive  to  mn  that  many  jobs  were  simply 
considered  to  be  prohibitively  expensive  to  simulate.  This  can  be  an 
especially  serious  problem  in  the  areas  of  high-end  High  Performance 
Computers  and  Supercomputers. 

(3)  Even  when  it  was  practical  to  mn  a  simulation,  it  was  frequently 
impractical  to  mn  it  multiple  times  to  quantify  the  benefits  of  various 
approaches  to  code  tuning. 

An  example  of  this  can  be  found  in  a  recent  posting  to  the  comp.arch  news  group 
by  a  quote  from  Christopher  Brian  Colohan,  a  graduate  student  at  Carnegie 
Mellon  University: 

I  am  currently  working  on  a  project  which  involves  detailed 
processor  simulation  of  the  SPECInt2000  benchmarks.  We  are 
encountering  the  usual  problem  with  simulations:  they  are  taking 
too  long,  even  when  we  use  the  provided  "test"  input.  (The  test 
input  for  bzip2  takes  41  seconds  to  mn  on  otu  SGI  Origin  machine 
~  simulating  more  than  a  second  or  so  of  real  CPU  time  on  our 
simulator  gets  kind  of  painful. . .). 

In  response  to  this  posting,  John  Mashey  of  SGI  responded: 

I  think  people  have  tried  to  achieve  this  effect  by  taking  slices  of 
SPEC  and  other  benchmarks  in  order  to  get,  for  example, 
reference  streams  to  analyze  alternate  designs  for  memory 
hierarchies  (i.e.,  this  is  different  from  changing  the  input).  For 
example,  for  some  codes,  the  performance  on  a  few  iterations 
would  model  the  entire  computation  but,  of  coiuse,  this  doesn't 
work  for  others  [1]. 


Note:  Definitions  for  boldface  text  can  be  found  in  the  Glossary. 
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In  a  similar  vein,  Naraig  Manjikian  [2]  has  stated  that  when  using  the 
SimpleScalar  tool  set  on  a  333-MHz  Sun  Ultra/10  workstation,  the  simulation 
rate  was  approximately  300,000  instructions/ second  (compared  to  a  peak  rate  of 
over  1  billion  instructions /second  when  executing  code  directly  on  the 
hardware). 

Considering  the  BT,  CG,  LU,  and  SP  NAS  (class  B)  benchmarks  the  floating  point 
operations  count  range  from  54  to  686  billion  floating  point  operations  for  a 
single  run.  On  a  300-MHz  R12000-based  Origin  2000  using  a  single  processor, 
this  translates  into  measured  run  times  of  1,250-9,700  seconds.  Clearly,  if 
simulating  a  41-second  run  is  a  problem,  these  industry  standard  benchmarks  for 
high  performance  computing  must  be  all  but  out  of  reach. 

In  response  to  this  problem,  we  have  developed  an  entirely  different  approach  to 
simulating  these  runs.  This  approach  is  based  on  the  time  tested  concept  of 
"Back-of-the-Envelope"  calculations.  With  this  in  mind,  we  have  named  our 
program  "ENVELOPE."  Rather  than  trying  to  simulate  every  aspect  of  the 
microarchitecture,  this  approach  assumes  that  the  computer  architects  know  how 
to  design  a  processor.  In  particular,  if  they  claim  that  the  peak  floating  point 
speed  is  1  GFLOPS,  then  we  take  them  at  their  word  and  use  that  number  in  the 
simulation.  This  also  applies  to  various  numbers  involving  memory,  cache,  and 
TLB  latency  and  bandwidth.  When  combined  with  other  system  parameters  and 
information  about  the  code  to  be  run  on  the  system,  an  estimated  performance  is 
produced  in  a  small  fixed  amount  of  time  (e.g.,  1  second).  This  run  time  is  short 
enough  to  allow  one  to  easily  investigate  the  effects  of  turning  various  hardware 
features  (e.g.,  prefetching)  on  or  off  and/or  investigate  various  ways  in  which 
the  code  might  be  tuned. 


2.  Description  of  the  Simulator 


The  current  version  of  ENVELOPE  prompts  the  user  for  input  (this  can  come 
from  a  redirected  file),  writes  its  output  to  standard  output,  and  also  writes  an 
annotated  copy  of  the  input  to  the  file  ENVELOPE.INPUTS.  This  later  file  can 
easily  be  edited  and  used  as  input  in  a  future  run  (the  annotations  will  be 
ignored  by  ENVELOPE).  The  output  is  broken  into  two  categories — prompts  for 
input  and  results.  Every  line  that  is  considered  to  be  a  result  begins  with  a  pound 
sign  (#)  so  that  it  can  easily  be  searched  when  using  GREP.  Under  the  direction 
of  Shirley  Moore  of  the  University  of  Tennessee  at  Knoxville,  work  is  under  way 
to  produce  a  friendlier  user  interface  using  Java. 

The  first  18  lines  of  input  describe  the  hardware  in  temrs  of  peak  characteristics 
(e.g.,  peak  bandwidth  between  the  outermost  level  of  cache  and  the  processor,  in 
million  bytes  per  second),  minimum  characteristics  (e.g.,  memory  latency  in 
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nanoseconds),  fundamental  values  (e.g.,  cache  line  size  in  bytes),  and 
miscellaneous  descriptions  regarding  the  friendliness  of  the  processor s  design 
(e.g.,  describing  the  pipeline  depth  as  short,  moderate,  deep,  or  very  deep).  This 
last  set  of  values  is  used  to  estimate  how  much  the  peak  speed  of  the  processor 
should  be  discounted.  The  goal  here  is  to  somewhat  level  the  playing  field 
between  designs  that  are  blazingly  fast  but  terribly  unfriendly  to  program  and 
those  that  made  tradeoffs  between  peak  speed  and  usability  by  keeping  the 
number  of  operations  in  flight  low  and/or  supporting  features  such  as  Out-of- 
Order  execution  and/or  register  renaming.  In  most  cases,  this  heuristic  makes 
little  difference  in  the  final  result,  since  the  performance  of  the  memory  system  is 
frequently  the  limiting  factor. 

The  remaining  questions  ask  the  user  to  describe  the  software  and  its  memory 
access  patterns.  In  theory,  this  information  can  be  derived  through  inspection  of 
the  user's  program.  In  practice,  only  the  simplest  of  programs  can  be  analyzed  in 
this  manner  with  sufficient  precision.  As  an  aid  to  this  process,  a  tool  has  been 
developed,  which  will  be  described  in  more  detail  in  sections  4  and  5.  The  main 
characteristic  here  is  that  the  variable  usage  can  be  broken  down  into  the 
following  categories: 

(1)  Variables  that  spend  a  significant  amount  of  time  mapped  to  a  register  and 
therefore  have  a  negligible  number  of  loads  or  stores  associated  with  them. 

(2)  Scratch  arrays  that  have  potentially  been  sized  to  fit  in  one  of  the  levels  of 
cache.  An  estimate  of  the  size  of  the  working  set  is  supplied  by  the  user. 
This  can  also  be  used  to  estimate  the  parameters  for  any  working  set  that 
might  exist.  The  assumption  here  is  that  there  is  a  negligible  cache  miss  rate 
associated  with  this  working  set,  but  that  the  flow  of  data  between  the 
cache  and  the  processor  still  needs  to  be  modeled. 

(3)  Blocked  arrays  or  alternatively  a  second,  presumably  larger,  working  set.  If 
this  is  used  to  model  a  blocked  access  pattern  and  if  the  working  set  is  too 
large  to  fit  into  cache,  a  STRIDE-N  access  pattern  is  assumed.  However,  if 
the  option  of  treating  this  as  just  another  working  set  is  used  and  if  it  fails 
to  fit  into  cache,  a  STRIDE-1  access  pattern  is  assumed.  This  flexibility 
allows  us  to  model  the  behavior  of  a  wide  range  of  programs  that  benefit 
from  the  presence  of  a  large  cache  and  is  particularly  useful  for  programs 
with  two  or  more  distinct  sizes  of  working  sets. 

(4)  Arrays  accessed  with  a  STRIDE-N  access  pattern.  In  most  cases,  this  access 
pattern  will  result  in  a  high  TLB  miss  rate.  Almost  all  programs  that  we 
have  looked  at  have  at  least  some  data  that  is  accessed  in  this  way; 
although  for  well  tuned  codes,  less  than  0.1%  of  all  loads/stores  will  fall 
into  this  category.  The  LU  NAS  benchmark  is  an  interesting  exception  to 
this  rule.  It  can  have  0.45-0.57%  of  all  loads/stores  falling  into  this  category 
(depending  on  the  version  of  the  code  being  used).  Fortunately,  on  many 
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systems,  the  resulting  TLB  misses  seem  to  hit  in  cache,  resulting  in  an 
acceptable  level  of  performance.  Unfortunately,  it  can  be  difficult  to  predict 
such  behavior  without  a  priori  knowledge  and/or  experimental  results  to 
compare  to.  However,  when  such  results  are  available,  the  hardware 
parameter  for  the  TLB  latency  can  be  adjusted  to  a  more  appropriate  value 
(e.g.,  subtract  off  the  cost  of  a  cache  miss  from  the  normally  used  cost  of  a 
TLB  miss). 

(5)  Arrays  accessed  with  a  STRIDE-1  access  pattern. 

For  categories  3-5,  the  amount  of  data  reuse  (cache  and  register  levels  combined) 
can  be  specified.  This  allows  accurate  modeling  of  programs  that  might  not  have 
a  working  set,  or  alternatively,  the  working  set  might  be  orders  of  magnitude 
larger  than  the  cache.  Even  so,  the  program  need  not  be  restricted  to  using  each 
data  item  in  a  cache  line  just  once  per  cache  miss.  However,  for  some  usage 
patterns,  a  usage  factor  in  the  range  of  1  to  2  is  exactly  what  will  be  seen.  The 
ability  to  specify  the  amount  of  data  reuse  supports  the  widest  possible  range  of 
programs  without  requiring  hard  coding  in  any  assumptions. 

ENVELOPE  has  been  extensively  tested  with  several  numerically  intensive 
programs  using  a  variety  of  RISC  and  CISC  processors.  It  is  also  designed  to 
handle  codes  with  few,  if  any,  floating  point  operations.  Furthermore,  it  should 
be  able  to  model  other  types  of  architectures  (e.g.,  vector  processors),  although 
no  attempt  has  been  made  to  date  to  exercise  either  of  these  capabilities. 

Table  1  shows  an  example  of  an  input  file  describing  the  Linpack  100  x  100 
benchmark  running  on  an  IBM  SP  with  375-MHz  Power  3  Thin  nodes.  The 
parameters  in  this  file  were  derived  using  a  detailed  analysis  of  the  source  code, 
as  well  as  taking  into  account  common  compiler  optimizations. 

Table  2  shows  an  example  of  an  input  file  describing  the  NAS  CG  class  B 
benchmark  (MPI).  Since  this  benchmark  uses  an  unstructured  grid,  which 
inhibits  prefetching,  prefetching  has  been  disabled.  Again,  the  system  being 
modeled  is  an  IBM  SP  with  375-MHz  Power  3  Thin  nodes. 

The  predicted  level  of  performance  for  the  Linpack  100  on  the  IBM  SP  (as 
previously  described)  is  359  MFLOPS,  while  the  measured  level  of  performance 
is  426  MFLOPS.  The  predicted  performance  is  within  19%  of  the  measured 
performance.  If  one  assumes  that  the  benchmark  was  run  on  a  dedicated  node, 
then  the  processor  could  have  used  a  larger  percentage  of  the  memory 
bandwidth.  This  allows  the  predicted  performance  to  increase  to  between  481 
and  559  MFLOPS  (depending  on  the  precise  limitations  of  the  processors 
memory  interface),  or  within  11-31%  of  the  measured  performance  [3]. 

For  the  CG  benchmark,  the  predicted  level  of  performance  is  53  MFLOPS.  The 
measured  level  of  performance  is  46  MFLOPS,  or  within  15%  of  the  predicted 
level  of  performance  [4].  Table  3  contains  additional  results. 
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3.  The  Equations 


This  section  will  discuss  some  of  the  equations  used  by  ENVELOPE  in  estimating 
the  performance  of  a  program.  The  complete  set  of  equations  is  quite  lengthy  and 
is  beyond  the  scope  of  this  paper.  However,  this  discussion  should  be  adequate 
in  giving  one  a  feel  for  how  this  program  works. 


3.1  Commonly  Scalar  Values 


The  time  spent  loading  and  storing  commonly  used  values  into  the  registers  is 
computed  using  the  following  equations.  It  is  important  to  note  that  it  is 
assumed  that  these  values  can  be  found  in  one  or  more  levels  of  cache  and  these 
operations  will  not  result  in  any  TLB  misses.  There  is  also  an  implied  assumption 
that  a  processor  cannot  do  more  than  one  memory  operation  per  opportunity  to 
laxmch  a  multiply-add  instruction.  A  few  processors  can,  in  fact,  do  better  than 
that  under  certain  circumstances.  For  most  RISC  and  CISC  processors,  this  is  not 
a  concern.  However,  it  might  be  an  important  consideration  if  this  code  were  to 
be  used  to  model  the  performance  of  a  Cray  C90. 


Runtime  =  Runtime  + 


_ P#SV@  *  #s _ 

UR#SV@  *  UL2BANDWIDTH ' 


(1) 


where  #  can  either  be  L  for  LOAD  or  S  for  STORE  and  @  can  either  be 
DEDICATED*  or  GENERAL.t 

P#SV@  refers  to  the  percentage  of  the  total  amotuit  of  data  that  is  either  being 
loaded  or  stored  (depending  on  what  #  is)  of  the  specified  class  of  values  (as 
specified  by  @).  #S  is  the  total  amount  of  data  that  is  either  loaded  or  stored. 
Therefore,  the  numerator  refers  to  the  amount  of  data  being  loaded  or  stored  for 
this  class  of  data.  UR#SV@  refers  to  the  amoxmt  of  reuse  at  the  register  level  for 
the  data  in  question.  The  higher  the  level  of  reuse,  the  fewer  the  loads  and  stores 
that  will  actually  be  executed.  In  other  words,  for  programs  like  Linpack,  the 
code  might  indicate  that  a  value  will  be  loaded  for  each  and  every  multiply-add 
instruction.  However,  a  smart  compiler  might  actually  perform  the  load  just 
once.  In  that  case,  the  value  for  UR#SV@  would  be  large  enough  to  make  this 
part  of  the  calculation  irrelevant.  Finally,  UL2BANDWIDTH  is  the  bandwidth 


*  The  term  DEDICATED  implies  that  the  values  will  only  be  used  in  calculations  involving 
other  commonly  used  values.  A  prime  example  of  this  is  Homer's  algorithm  for  evaluating 
polynomial  equations. 

t  The  term  GENERAL  implies  that  while  the  value  will  be  "locked"  into  a  register,  it  will  be 
used  in  conjxmction  with  other  classes  of  variables.  Therefore,  the  cost  of  the  floating  point  or 
integer  calculations  that  this  value  is  involved  with  will  be  charged  to  the  other  classes  of  variables. 
An  example  of  this  type  of  variable  might  be  PI. 
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between  the  processor  and  the  outermost  level  of  cache  (frequently,  the  L2 
cache). 

For  the  DEDICATED  variables,  it  is  also  necessary  to  calculate  the  time  spent 
performing  calculations  involving  these  variables.  This  calculation  is  fairly 
straightforward.  The  only  complicating  factor  is  that  since  some,  if  not  all,  of 
these  calculations  can  be  paired  up  with  load  and  store  instructions  on  today's 
superscalar  processors,  one  must  avoid  counting  these  cycles  twice.  This  can  be 
done  by  subtracting  the  time  spent  on  the  loads  and  stores  for  DEDICATED 
variables  from  the  time  spent  computing  with  them.  Since  it  is  possible  that  the 
loads  and  stores  will  take  longer,  one  needs  to  take  the  maximum  of  the 
difference  and  zero  to  avoid  overcompensation. 

3.2  Scratch  Arrays 

The  next  order  of  business  is  to  account  for  the  time  spent  dealing  with  small 
scratch  arrays  that  will  normally  be  "locked"  into  one  or  more  level  of  cache. 
This  is  not  assumed  to  be  the  case,  but  will  generally  be  the  case.  Using  the  same 
notation  and  terminology  as  in  subsection  3.1,  the  equations  will  be: 


Runtime  =  Runtime  +  - 


P#SA@  *  #S 


UR#SA@  *  UL2BANDWIDTH 


(2) 


Again,  one  must  also  take  into  account  the  time  spent  performing  caluclations 
that  only  involve  the  scratch  array  and  the  commonly  used  values  discussed  in 
section  3.1. 

A  complicating  factor  is  that  one  must  also  take  into  account  the  cache  misses 
associated  with  these  scratch  arrays.  ENVELOPE  assumes  that  the  number  of 
TLB  misses  associated  with  the  scratch  arrays  will  be  negligible.  If  the  working 
set  for  the  scratch  arrays  is  larger  than  the  size  of  the  outermost  level  of  cache, 
then  ENVELOPE  will  assume  a  STRIDE-1  access  pattern.  Otherwise,  ENVELOPE 
assumes  that  there  is  a  sufficient  level  of  data  reuse  at  the  cache  level  that  the  cost 
of  batching  up  the  cache  can  be  ignored.  Since  this  will  normally  be  the  case,  we 
will  not  discuss  the  equations  used  any  other  time. 

3.3  Blocked  Memory  Access  Patterns 

This  portion  of  the  code  can  either  be  used  to  estimate  the  cost  of  a  code  segment 
with  a  blocked  memory  access  pattern  or  to  estimate  the  cost  of  code  segment 
involving  scratch  arrays  with  a  larger  working  set.  In  the  latter  case,  the 
estimated  size  of  the  working  set  should  be  negated  and  specified  as  the  block 
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size.*  The  details  associated  with  the  handling  of  the  case  where  the  block  size 
(or  second  working  set  size)  is  larger  than  the  size  of  the  cache  will  not  be 
discussed.  However,  they  are  similar  to  the  cases  discussed  in  subsections  3.4 
and  3.5. 

With  everything  fitting  into  cache,  one  still  has  to  be  concerned  with  the  cache 
misses  associated  with  batching  up  the  cache.  Furthermore,  unlike  subsection  3.2, 
ENVELOPE  does  not  assume  that  this  cost  can  be  ignored.  In  the  case  of  a 
blocked  access  pattern,  the  level  of  reuse  might  be  fairly  modest  (e.g.,  2-10).  Even 
in  the  case  of  a  second  level  of  working  set,  the  level  of  reuse  might  be  more 
limited.  This  part  of  the  program  is  split  into  the  following  two  cases: 

(1)  In  the  case  where  prefetching  has  been  essentially  disabled  by  setting  the 
maximum  number  of  outstanding  prefetches /cache  misses  to  1,  the  latency 
of  a  cache  miss  will  effectively  determine  the  usable  memory  bandwidth. 

(2)  In  the  other  case,  the  memory  bandwidth  will  be  the  limiting  factor 

The  first  step  is  to  determine  the  number  of  cache  misses.  This  number  is  not 
actually  used  for  the  calculation  of  runtime  and  performance.  It  is  however  used 
to  estimate  the  number  of  bus  transactions.  Calculating  the  time  required  to 
perform  the  cache  misses  is  much  more  complicated.  Some  of  the  complicating 
factors  are  as  follows: 


*  The  negative  value  is  used  as  a  flag  when  the  working  set  is  larger  than  the  cache.  If  the  block 
size  is  positive,  then  ENVELOPE  will  estimate  the  cost  of  the  loads  and  stores  as  though  this  code 
segment  was  using  a  STRIDE-N  access  pattern  (the  most  expensive  access  pattern).  However,  if 
the  block  size  is  negative,  then  ENVELOPE  estimates  the  cost  of  the  loads  and  stores  as  though  this 
code  segment  was  using  a  STRIDE-1  access  pattern.  While  this  access  pattern  is  more  expensive 
than  living  out  of  cache,  it  is  significantly  less  expensive  than  a  STRIDE-N  access  pattern. 

t  Earlier  in  the  program,  the  memory  bandwidth  was  adjusted,  when  necessary,  to  handle  the 
situation  in  which  the  supported  level  of  prefetching  was  insufficient  to  fully  utilize  the  complete 
memory  bandwidth. 

^  EIWELOPE  actually  supports  three  separate  values  for  the  memory  bandwidth: 

(1)  The  bandwidth  when  only  performing  loads. 

(2)  The  bandwidth  when  only  performing  stores. 

(3)  The  bandwidth  when  performing  a  balanced  mix  of  loads  and  stores. 

For  some  systems,  the  three  values  will  be  the  same.  However,  for  the  HP  PA  8XXX  series  of 
processors,  on  a  well  balanced  system,  the  third  value  will  actually  be  the  sum  of  the  first  two 
values.  As  a  result  of  this,  it  is  necessary  to  determine  the  amount  of  data  being  loaded  and  stored, 
pair  the  loads  and  stores  at  the  mixed  bandwidth  rate,  and  then  finish  up  with  any  nonpaired  loads 
or  stores.  This  assumes  that  the  code  pairs  loads  and  stores  as  much  as  possible.  This  is  not  always 
the  case,  but  is  a  good  assumption  for  codes  that  call  the  BLAS  routines,  copy  arrays,  transpose 
arrays,  and  perform  a  variety  of  other  common  operations.  However,  for  a  program  which  copies 
data  into  a  small  buffer  that  is  ''locked"  into  cache,  pounds  on  the  buffer,  and  then  writes  the 
results  out  to  a  large  global  array,  this  can  be  a  poor  assumption.  Fortunately,  for  many  systems, 
this  discussion  is  academic,  since  the  three  values  are  identical.  In  the  remaining  situations,  it 
appears  as  though  the  maximum  error  is  an  overestimation  of  the  performance  (under  estimation 
of  the  run  time)  by  a  factor  of  2. 
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(1)  Additional  memory  traffic  resulting  from  the  coherency  protocols  for  cache 
lines  that  are  stored  to. 

(2)  The  allocation  of  memory  bandwidth  to  handle  the  portion  of  the  loads  and 
stores  that  can  be  overlapped  with  each  other. 

(3)  Calculating  the  time  required  for  those  loads  or  stores  that  could  not  be 
overlapped  with  stores  or  loads  respectively. 

(4)  The  grouping  of  data  into  data  structures.  In  particular,  the  inefficiencies 
that  can  result  if  most  accesses  to  the  structure  use  less  than  100%  of  the 
data  (this  will  result  in  the  consumption  of  additional  memory  bandwidth). 

(5)  The  level  of  temporal  locality  (the  number  of  times  a  value  is  reused  prior 
to  eviction  from  the  cache). 

Additionally,  the  number  and  cost  of  the  TLB  misses  must  be  accounted  for. 
Fortunately,  this  process  is  somewhat  simpler  than  it  was  for  cache  misses,  since 
ENVELOPE  assumes  that  TLB  misses  are  handled  one  at  a  time  by  the  processor 
and  cannot  be  overlapped  with  anything.  In  the  case  where  the  cost  of  the  stores 
is  expected  to  be  greater  than  the  cost  of  the  loads  and  prefetching  is  supported, 
the  equations  are  as  follows; 

PLGB  *  LOADS  GSIZE 

current  cache  misses  =-——— - ^  *  ^T'.T’XTOTrr’v  • 

UCLGB  *  L2LSIZE  GDENSITY 

PLGB  *  LOADS  *  RMTRANSACTIONS 

cuirent  mn  time - -  - • 

UCLGB  *  PUMRBANDWIDTH 

current  TLB  misses = 

f  PLGB  *  LOADS  PSGB  *  STORES  ^  ^  GSIZE 
^  UCLGB  *  PAGESIZE  UCSGB  *  PAGESIZE  J  GDENSITY  ’ 

TEMPBUSTRANS  ACTIONS  =  current  cache  misses  * 
RMTRANSACTIONS  +  current  TLB  misses. 

GSIZE 

TEMPRUNTIME  =  current  run  time  * - 1- 

GDENSITY 

current  TLB  misses  *  TLATENCY  *  10 


(3) 

(4) 

(5) 

(6) 

(7) 


current  cache  misses  = 


PSGB  *  STORES 


GSIZE 


UCSGB  *  L2LSIZE  GDENSITY 
current  bus  transactions  =  current  cache  misses  *  WMTRANSACTIONS. 


(8) 

(9) 


current  run  time  = 
PSGB  *  STORES 


WMTRANSACT  IONS 
- current  run  time 


UCSGB 


UMWBANDWIDTH 


PUMWBANDWIDTH  (10) 
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(11) 


TEMPRUNTIME = TEMPRUNTIME + current  run  time 


GSIZE 

GDENSITY 


total  run  time  =  total  run  time  +  TEMPRUNTIME. 


(12) 


TEMPBUSTRANSACnONS  = 

TEMPBUSTRANSACTIONS+ current  bus  transactions, 
total  bus  transactions  = 

total  bus  transactions  +  TEMPBUSTRANSACnONS. 

Other  cases  which  are  handled  separately  but  in  a  similar  manner  are: 

(1)  Working  set  fits  into  cache,  with  loads  taking  longer  than  Stores,  with 
prefetching  supported. 

(2)  Working  set  does  not  fit  into  cache,  and  default  back  to  a  STRIDE-N  access 
pattern  has  been  specified. 

(3)  Working  set  does  not  fit  into  cache,  and  default  back  to  a  STRIDE-1  access 
pattern  has  been  specified.  In  this  case,  all  of  the  possible  ways  for 
computing  a  STRIDE-1  access  pattern  for  a  data  set  not  fitting  into  cache 
must  be  handled  (see  subsection  3.5  for  details). 

(4)  Working  set  fits  into  cache,  but  prefetching  is  not  supported.  This  precludes 
the  possibility  that  cache  misses  resulting  from  loads  and  stores  can  be 
overlapped,  since  cache  misses  resulting  from  loads  cannot  even  overlap 
with  themselves.* 

3.4  STRIDE-N  Memory  Access  Patterns 

In  theory,  this  should  be  the  easiest  of  the  memory  access  patterns  to  handle.  In 
reality,  some  additional  complications  have  arisen  that  are  causing  some 
problems.  These  complications  will  be  discussed  in  greater  detail  as  this  section 
progresses.  Unlike  subsection  3.3,  there  are  only  two  cases  that  need  to  be 
considered  here: 

(1)  All  of  the  data  fits  into  cache.  In  this  case,  ENVELOPE  assigns  the  cost  of 
batching  up  the  cache  to  this  access  pattern.  The  cost  is  calculated  in  a  very 
simple  manner  designed  to  produce  the  smallest  possible  cost  (e.g.,  the 
minimum  number  of  cache  misses  and  assuming  that  prefetching  is 
supported). +  Additionally,  if  any  calculations  are  specified  for  this  access 
pattern,  their  cost  will  also  be  calculated. 


(13) 

(14) 


*  In  some  cases,  this  might  be  too  pessimistic,  since  some  systems  might  support  the 
overlapping  of  coherency  traffic  (e.g.,  write  backs  from  cache  to  main  memory  as  part  of  an 
eviction)  with  the  handling  of  a  cache  miss.  It  is  not  clear  what  the  exact  importance  of  this  form  of 
overlapping  would  be;  therefore,  it  is  currently  treated  as  a  higher  order  effect  and  is  ignored. 

t  In  some  cases,  this  may  be  overly  optimistic;  however,  it  is  unlikely  that  any  jobs  other  than 
some  of  the  smaller  benchmarks  (e.g..  Unpack  100  x  100)  will  ever  fall  into  this  case. 
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(2)  By  far,  the  more  common  case  and,  therefore,  the  one  that  will  be 
considered  in  greater  detail  is  the  case  where  the  amount  of  data  is  too 
large  to  fit  into  the  cache  (probably  by  a  large  amount). 

In  this  section,  the  following  simplifying  assumptions  are  made  when  handling 
the  second  case: 

(1)  TLB  misses  do  not  overlap  with  anything. 

(2)  TLB  misses  have  a  fixed  cost,  with  the  page  table  entry  coming  from  main 
memory. 

(3)  Effectively,  no  form  of  overlapping  multiple  cache  misses  can  occur,  since 
each  of  the  cache  misses  will  be  associated  with  a  TLB  miss. 

(4)  The  grouping  of  data  into  structures  can  reduce  the  number  of  TLB  misses.’ 

(5)  Each  TLB  miss  is  expected  to  have  a  cache  miss  associated  with  it.  In 
reality,  it  is  possible  for  an  algorithm  (e.g.,  LU  decomposition)  to  loop 
through  a  few  hundred  pages  of  data  with  a  STRIDE-N  access  pattern.  In 
such  a  case,  there  can  be  significantly  more  TLB  misses  than  cache  misses. 
However,  for  an  algorithm  with  a  random  access  pattern  and/ or  a  working 
set  involving  tens  of  thousands  (or  more)  of  pages  of  data,  each  TLB  miss 
should  have  a  corresponding  cache  miss.+ 

Of  all  of  these  assumptions,  the  second  assumption  seems  to  be  causing  the  most 
problems.  If  the  TLB  misses  occur  in  a  cyclic  fashion,  it  is  conceivable  that  a 
processor  might  store  the  page  table  entries  in  cache.  On  many  systems,  this 
would  decrease  the  cost  of  a  TLB  miss  by  at  least  a  factor  of  2.  Calculations  based 
on  benchmarking  studies  made  for  the  NAS  LU  Benchmark  (class  B)  indicate  that 
on  many  systems,  this  is  almost  certainly  happening.  At  the  present  time,  the 
best  solution  to  dealing  with  such  a  case  is  to  significantly  decrease  the  estimated 
cost  for  TLB  misses  when  modeling  such  a  run.  It  should  be  noted  that  to  avoid 
problems  with  ENVELOPE'S  error  checking  and  default  mechanisms,  the  cost  of 
a  TLB  miss  should  be  at  least  1  nanosecond. 


*  For  codes  that  were  analyzed  by  hand,  this  can  be  an  important  effect.  However,  since  the 
author  expects  that  most  codes  will  be  analyzed  using  the  tool  discussed  in  section  4,  this  effect  is 
insignificant.  The  tool  discussed  in  section  4  recommends  input  to  ENVELOPE  based  on  the 
output  from  a  Perfex  run.  In  that  case,  any  benefit  from  the  grouping  of  data  into  structures  has 
already  been  accounted  for  by  a  corresponding  reduction  in  the  measured  TLB  miss  rate.  Such  a 
reduction  would  be  expressed  as  a  decrease  in  the  estimated  percentage  of  the  work  that  is  mapp>ed 
to  the  STRIDE-N  access  pattern. 

t  This  is  not  as  serious  a  problem  as  one  might  expect.  If  one  assumes  that  most  programs  will 
be  analyzed  by  the  tool  in  section  4,  then  since  that  tool  makes  the  same  assumption,  everything 
should  work  out.  There  might  be  some  minor  discrepancies  due  to  the  cache  misses  being  handled 
one  at  a  time;  whereas,  if  they  were  mapped  to  a  STRIDE-1  access  pattern  they  could  be  overlapped 
using  prefetching.  However,  for  a  well-tuned  code,  one  can  expect  the  number  of  TLB  misses  to  be 
significantly  smaller  than  the  number  of  cache  misses.  In  this  situation,  all  of  this  becomes  a  higher 
order  effect  that  can  be  safely  ignored. 
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The  resulting  equations  are  as  follows: 


,  ^  PLGN  *  LOADS 

current  cache  misses  - - . 

UCLGN  *  DSIZE  *  GDENSITY 
current  TLB  misses  =  current  cache  misses. 

*  u  •  *  1.  •  r,  GSIZE  *  DSIZE' 

current  cache  misses  =  current  cache  misses  *  1  + - _ _  . 

L2LSIZE 

current  bus  transactions  =  current  cache  misses  * 
RMTRANSACTIONS  +  current  TLB  misses. 

'^current  cache  misses  *  MLATENCY 
current  run  time  =  RMTRANSACTIONS  +  *  10'^. 

j^current  TLB  misses  *  TLATCNCY  j 

TEMPRUNTIME  =  current  run  time. 

TEMPBUSTRANSACTIONS=  current  bus  transactions. 

•  ,  PSGN  *  STORES 

current  cache  rmsses  = - - . 

UCSGN  *  DSIZE  *  GDENSITY 
current  TLB  misses  =  current  cache  misses. 

current  cache  misses  =  current  cache  misses  *  1+  - -  . 

L2LSIZE 


(15) 

(16) 

(17) 

(18) 

(19) 

(20) 

(21) 

(22) 

(23) 


current  bus  transactions  =  current  cache  misses  * 
WMTRANSACnONS  +  current  TLB  misses. 

'^current  cache  misses  *  MLATENCY  * 
current  run  time  =  WMTRANSACTIONS  + 

^^current  TLB  misses  *  TLATENCY 

TEMPRUNTIME  =  TEMPRUNTIME  +  current  run  time, 
total  run  time  =  total  run  time  +  TEMPRUNTIME. 
TEMPBUSTRANS  ACTIONS  = 

TEMPBUSTRANSACTIONS  +  current  bus  transactions. 


(24) 

(25) 

(26) 

(27) 

(28) 


total  bus  transactions  =  total  bus  transactions  + 

(29) 

TEMPBUSTRANSACTIONS. 

TEMPRUNTIME  = 

2.0  *  PGNMADDS  *  NMADDS  PGMULTIPLIES  *  NMULTIPLIES  (30) 
UMADDS  UMULTIPLIES 
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PGNADDS  *  NADDS 


TEMPRUNTIME  =  TEMPRUNTIME  +  - 
PGNIOPS  *  NIOPS 


UADDS 


UIOPS 

TEMPRUNTIME  = 


( 


MAX 


TEMPRUNTIME, 


V 


PLGN  *  LOADS +  PSGN  *  STORES 
UL2BAND  WIDTH 


total  run  time  =  total  run  time  +  TEMPRUNTIME. 


(31) 

(32) 

(33) 


3.5  STRIDE-1  Memory  Access  Patterns 

Once  again,  the  STRIDE-1  Memory  Access  pattern  requires  the  handline  of  the 
following  special  cases: 

(1)  The  data  set  being  small  enough  to  fit  entirely  in  cache.  Again,  this  case  is 
primarily  there  to  support  some  small  benchmarks  (e.g.,  Linpack  100  x 
100).  As  seen  in  subsection  3.4,  the  cost  of  the  cache  and  TLB  misses  have 
already  been  accounted  for.  Therefore,  all  that  remains  is  the 
straightforward  handling  of  the  cost  of  the  instructions  themselves.  This  is 
done  in  a  manner  that  is  very  similar  to  the  last  four  equations  in 
subsection  3.4. 

(2)  A  STRIDE-1  access  pattern  with  prefetching  disabled.  This  implies  that  the 
cache  misses  do  not  overlap.  Therefore,  whether  the  loads  or  stores  take 
longer  to  complete  is  not  a  concern. 

(3)  The  cases  of  a  STRIDE-1  access  pattern  with  prefetching  enabled.  The 
relative  costs  of  the  loads  and  stores  is  now  a  concern.  While  these  two 
cases  must  be  handled  separately,  the  resulting  equations  both  look  very 
similar  to  those  used  in  subsection  3.3.  Therefore,  they  will  not  be  repeated 
here. 

What  will  be  looked  at  here  is  the  second  case.  The  resulting  equations  are  as 
follows: 


current  cache  misses  = 


PLGl  *  LOADS 
UCLGl  *  L2LSIZE 


GSIZE 

GDENSITY' 


(34) 


current  TLB  misses  = 

(  PLGl  *  LOADS  I  GSIZE  ^ 

^  UCLGl  *  PAGESIZE  GDENSITY  / 

current  bus  transactions  =  current  cache  misses  * 
RMTRANS  ACTIONS  +  current  TLB  misses. 


(35) 

(36) 
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current  run  time  = 


current  cache  misses  *  MLATENCY  * 
RMTRANSACTIONS + current  TLB  misses 
*  TLATENCY 


♦  10 


-9 


TEMPRUNTIME  =  current  run  time. 
TEMPBUSTRANSACTIONS  =  current  bus  transactions. 

PSGl  *  STORES  ^  GSIZE 


current  cache  misses  = 


current  TLB  misses  =  - 


UCSGl  *  L2LSIZE 
PSGl  *  STORES 


GDENSITY 

GSIZE 


current  run  time  = 


UCSGl  *  PAGESIZE  GDENSITY' 

current  bus  transacticms  =  current  cache  misses  * 
WMTRANSACnONS  +  current  TLB  misses. 

'^current  cache  misses  *  MLATENCY 
WMTRANSACTIONS+  ♦  10 

current  TLB  misses  *  TLATENCY 
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TEMPRUNTIME  =  TEMPRUNTIME  +  current  run  time. 

total  run  time  =  total  run  time  +  TEMPRUNTIME. 

TEMPBUSTRANSACTIONS  = 

TEMPBUSTRANSACTIONS  +  current  bus  transactims. 

total  bus  transactions  =  total  bus  transactions  + 
TEMPBUSTRANSACTIONS. 

TEMPRUNTIME  = 

2.0  *  PGIMADDS  *  NMADDS  ^  PGIMULTIPLIES  *  NMULTIPLIES 


UMADDS 

TEMPRUNTIME  =  TEMPRUNTIME  + 

PGIIOPS  *  NIOPS 
UIOPS 


UMULTIPLIES 

PGIADDS  *  NADDS 
UADDS 


TEMPRUNTIME  = 


MAX 


TEMPRUNTIME, 


PLGl  *  LOADS  +  PSGl  *  STORES^ 


UL2BANDWIDTH 
total  run  time  =  total  run  time + TEMPRUNTIME. 


(37) 

(38) 

(39) 

(40) 

(41) 

(42) 

(43) 

(44) 

(45) 

(46) 

(47) 

(48) 

(49) 

(50) 

(51) 


4.  Associated  Tools 


Unfortunately,  most  people  will  find  it  difficult,  if  not  impossible,  to  analyze  the 
usage  patterns  of  most  programs  with  sufficient  detail  for  use  with  ENVELOPE. 
In  an  attempt  to  solve  this  problem,  we  have  written  a  second  program  which 
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prompts  for  information  from  an  instrumented  run  (e.g.,  Perfex  on  an  SGI  system 
and  or  the  Hardware  Performance  Monitor  on  Cray  vector  systems).  Based  on  a 
modest  number  of  questions,  it  will  solve  a  series  of  equations  and  supply  a  set 
of  numbers  for  use  with  ENVELOPE.  Unlike  ENVELOPE,  some  assumptions  and 
heuristics  are  used  in  this  tool.  As  a  result,  the  results  may  not  be  unique  and 
probably  will  not  exactly  match  what  would  be  produced  by  a  detailed  analysis 
of  the  user's  program.  However,  the  results  should  be  sufficient  to  allow 
ENVELOPE  to  accurately  predict  many  aspects  of  the  performance  of  the  user's 
program  (e.g.,  run  time  and  performance  in  terms  of  MFLOPS).  Table  4  shows  a 
sample  run  of  this  program. 


5.  The  Equations  for  the  Associated  Tools 


This  section  will  discuss  some  of  the  equations  used  by  the  tool  which  uses  data 
from  Perfex  (or  similar  programs/ libraries,  e.g.,  PAPI)  to  simplify  the  job  of 
creating  an  input  file  for  ENVELOPE.  This  tool  contains  two  parts.  The  first  part 
is  optional,  and  when  used,  will  use  one  of  two  approaches  (depending  on  the 
available  input  data)  to  estimate  the  number  of  floating  point  adds,  multiplies, 
and  multiply-add  instructions  that  are  executed  during  a  run.  It  should  be  noted 
that  each  of  these  instructions  actually  represents  a  group  of  instructions  (e.g., 
"adds"  includes  adds,  subtracts,  compares,  as  well  as  other  less  frequently  used 
instructions  such  as  convert,  int,  abs,  etc.).  Subsections  5.1 -5. 3  will  discuss  this 
part  of  the  tool  in  greater  detail.* 

The  second  part  of  the  tool  calculates  (or,  in  a  few  cases,  provides  crude  estimates 
based  on  rules  of  thumb)  most  of  the  remaining  inputs  needed  to  describe  the 
software  to  ENVELOPE.  This  part  of  the  tool  will  be  discussed  in  subsections 
5.4-5.8. 

5.1  Conventions  Used  in  Subsections  5.2  and  5.3 

The  following  conventions  will  be  used  in  subsections  5.2  and  5.3  to  simplify  the 
equations. 


*  One  important  point  to  remember  is  that  some  compilers  will  only  produce  independent 
multiply  and  add  instructions,  other  compilers  will  preferentially  produce  multiply-add 
instructions,  and  a  few  will  produce  a  mix  based  on  some  optimization  criteria.  Furthermore,  for 
some  hardware,  this  will  make  little  if  any  difference  in  the  performance.  However,  for  other 
systems,  there  might  be  a  significant  difference  in  performance  (e.g.,  up  to  a  factor  of  2).  Therefore, 
in  cases  where  the  tool  estimates  that  a  large  number  of  independent  multiply  and  add  instructions 
are  being  used,  one  might  want  to  calculate  the  performance  based  on  both  that  set  of  numbers  and 
on  the  assumption  that  the  hardware  is  executing  the  instructions  as  though  they  were  chained 
multiply-add  instructions.  Fortunately,  in  most  cases,  factors  such  as  the  amount  of  time  spent  on 
cache  misses  and  the  ratio  between  memory^  operations  (loads  and  stores)  vs.  floating  point 
operations  may  eliminate  most  of  the  potential  difference  in  performance. 
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NADDS  =  number  of  floating  point  add  instructions. 

NFINST  =  number  of  floating  point  instructions. 

NFLOPS  =  number  of  floating  point  operations. 

NMADDS  =  number  of  floating  point  multiply-add  instructions. 
NMULTIPLIES  =  number  of  floating  point  multiply  instructions. 
NRECIPROCALS  =  number  of  reciprocal  approximation  instructions. 
">"  =  greater  than. 

=  greater  than  or  equcd  to. 

"<"  =  less  than. 

“<"  =  less  than  or  equal  to. 


5.2  Estimating  the  Numbers  and  Types  of  Floating  Point  Instructions 
Using  a  Combination  of  a  priori  Data  and  Data  From  Perfex 

If  one  has  access  to  a  count  of  tiie  number  of  floating  point  operations  for  a  run, 
as  is  frequently  the  case  for  industry  standard  benchmarks,  then  one  can  use  that 
information  in  conjxmction  witii  the  floating  point  instruction  cormt  from  Perfex 
to  estimate  the  number  of  times  each  of  the  three  classes  of  instructions  (adds, 
multiplies,  multiply-adds)  is  executed  during  a  run.  Alternatively,  by  comparing 
the  floating  point  instruction  cormt  from  two  runs  (one  compiled  with  the  use  of 
multiply-adds  enabled  and  one  compiled  with  their  use  disabled),  one  can  also 
use  the  following  equations: 

If  NFINST  s  NFLOPS,  then 

NMADDS  =  0.0, 

NADDS  =  0.5  ♦  NFLOPS, 

and 


NMULTIPLIES  =  NADDS. 


Otherwise,  if  FLINST  ^  0.5  *  NFLOPS,  then 


and 


NMADDS  =  0.5  *  NFLOPS, 
NADDS  =  0.0, 


NMULTIPLIES  =  0.0. 

Otherwise, 

NMADDS  =  NFLOPS  -  NFINST, 
NADDS  =  NFINST  -  NMADDS, 

and 


(52) 


(53) 
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NMULTIPLIES  =  0.0* 


(54) 


It  is  important  to  note  that  the  equations  being  used  have  been  made  more  robust 
by  eliminating  the  assumption  that  the  compiler  produced  a  floating  point 
operation  count  that  was  identical  to  that  produced  by  a  priori  knowledge.  In 
some  cases,  an  optimizing  compiler  might  do  slightly  better.  In  other  cases,  an 
optimizing  compiler  might  even  add  floating  point  operations  if  it  thought  that 
the  efficient  use  of  multiply-add  instructions  would  improve  the  overall 
performance  of  the  code.  By  not  relying  on  this  assumption,  one  can  be  certain 
that  none  of  the  operation  counts  will  ever  be  negative  or  exceed  the  specified 
number  of  floating  point  operations. 

5.3  Estimating  the  Numbers  and  Types  of  Floating  Point  Instructions 
Using  a  Combination  of  Data  From  HPM  and  Perfex 

If  one  has  access  to  both  the  output  from  the  hardware  performance  monitor  on  a 
Cray  Research  vector  processor  (e.g.,  C90)  and  the  floating  point  instruction 
count  from  Perfex,  one  can  estimate  the  number  of  times  each  of  the  three  classes 
of  instructions  (adds,  multiplies,  multiply-adds)  is  executed  during  a  run.  In  this 
case,  the  reciprocal  approximation  instruction  from  the  Cray  vector  processor 
will  be  lumped  in  with  the  multiply  instructions.  The  rationale  for  this  is  to  treat 
the  combination  of  the  reciprocal  approximation  with  the  additional  refinement 
step  as  a  divide  instruction.  ENVELOPE  suggests  that  divides  be  included  with 
the  multiply  instructions  with  an  appropriate  weighting  factor.  Effectively,  this  is 
what  we  are  doing  in  the  following  equations: 

If  NADDS  +  NMULTIPLIES  +  NRECIPROCALS  <  NFINST,  then 

NMULTIPLIES  =  NMULTIPLIES  +  NRECIPROCALS, 

NMADDS  =  0.0,  (55) 

and 

NADDS  remains  unchanged. 

Otherwise,  if  NADDS  +  NMULTIPLIES  +  NRECIPROCALS  ^  0.5  *  NFINST, 
then 


NMADDS  =  minimum  of  (NADDS  or  NMULTIPLIES), 
NADDS  =  NADDS  -  NMADDS, 


*  For  most  of  today's  processors,  the  cost  of  a  floating  point  add  is  the  same  as  the  cost  of  a 
floating  point  multiply.  Therefore,  assigning  all  of  the  excess  floating  point  instructions  to  the 
floating  point  adds  will  not  effect  the  results  produced  by  ENVELOPE.  However,  on  some  older 
processors  such  as  the  MIPS  R4000/R4400  and  the  KSRl,  this  assumption  is  no  longer  valid.  On 
these  machines,  there  may  be  a  difference  in  the  estimated  performance,  and  one  might  want  to 
determine  what  the  bounds  on  this  performance  are  by  running  ENVELOPE  twice — once  with  all 
of  the  excess  instructions  classified  as  floating  point  adds  and  the  second  time  with  all  of  the  excess 
instructions  classified  as  floating  point  multiplies. 
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and 


NMULTIPLIES  =  NMULTIPLIES  +  NRECIPROCALS  -  NMADDS.  (56) 
Otherwise,  if  NADDS  <  MULTIPLIES,  then 

NMADDS  =  minimum  of  (NRECIPROCALS  +  NADDS  +  NMULTIPLIES  - 
NFINST  or  NADDS), 

NADDS  =  NADDS  -  NMADDS, 

and 

NMULTIPLIES  =  NMULTIPLIES  +  NRECIPROCALS  -  NMADDS.  (57) 
Otherwise, 

NMADDS  =  minimum  of  (NRECIPROCALS  +  NADDS  +  NMULTIPLIES  - 
NFINST  or  NMULTIPLIES), 

NADDS  =  NADDS  -  NMADDS, 

and 

NMULTIPLIES  =  NMULTIPLIES  +  NRECIPROCALS  -  NMADDS.  (58) 

Again,  one  can  see  that  we  were  careful  to  handle  the  situations  where  the 
numbers  do  not  add  up.  However,  to  the  extent  tfiat  numbers  do  add  up,  we 
make  the  assumption  that  programs  take  advantage  of  chained  multiplies  and 
adds  on  the  Cray  vector  processors.  Therefore,  these  chained  operations  should 
be  translated  into  multiply-add  instructions  for  the  purposes  of  nmning 
ENVELOPE. 

5.4  Conventions  and  Approximations  Used  in  Subsections  5.5-5.8 
In  subsections  5. 5-5.8,  the  following  approximations  are  made: 

(1)  Since  Perfex  is  being  used  to  return  data  on  a  complete  run,  the  data  is  not 
broken  down  by  subroutine,  let  alone  memory  access  pattern.  Therefore, 
for  each  type  of  memory  access  pattern  (e.g.,  STRIDE-1),  this  tool  assumes 
that  the  same  percentage  of  loads  as  stores  are  used  in  a  section.  This  does 
not  mean  that  die  STRIDE-1  access  pattern  has  both  a  billion  loads  and  a 
biUion  stores.  Rather,  it  means  that  if  5%  of  the  total  loads  exhibit  the 
STRIDE-1  access  pattern,  we  will  assume  that  5%  of  the  total  stores  will  also 
exhibit  this  access  pattern. 

(2)  For  the  same  reasons  as  in  (1),  we  will  assume  that  the  same  percentage  of 
multiply-adds,  adds,  multiplies,  and  integer  operations  unrelated  to 
address  calculations  (normally,  the  number  of  these  calculations  will  be 
zeroed  out  for  floating  point  intensive  applications)  are  used  for  each  type 
of  memory  access  pattern.  Furthermore,  we  will  assume  that  this 
percentage  is  the  same  as  that  calculated  in  (1). 
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(3)  In  accordance  with  the  way  ENVELOPE  is  set  up,  we  will  assume  that  each 
TLB  miss  has  a  cache  miss  associated  with  it.  This  need  not  be  the  case  if 
the  cache  line  is  still  in  the  cache,  as  can  happen  with  a  STRIDE-N  access 
pattern  executed  with  a  cyclic  basis.  However,  if  the  number  of  data  items 
is  too  large  or  if  the  access  pattern  is  actually  fairly  random,  then  this 
assumption  is  correct.  In  either  case,  since  Perfex  does  not  provide  a  way  to 
distinguish  between  the  two  cases,  both  ENVELOPE  and  this  tool  have 
been  set  up  to  function  in  this  manner.  Therefore,  this  should  not  result  in 
any  problems. 

(4)  The  recommended  values  for  the  group  size  and  group  density  are  1  vmless 
known  otherwise,  in  which  case,  one  might  want  to  use  the  value  1  since 
the  consequences  of  the  grouping  of  data  were  already  factored  into  the 
output  of  Perfex  and  would  be  difficult  to  compensate  for  at  this  point. 

(5)  The  number  of  integer  loads  and  stores  is  negligible  for  a  floating  point 
intensive  code.  So  the  time  required  for  them  can  be  ignored. 

(6)  Data  reuse  at  the  register  level  has  already  been  factored  in  by  the  compiler, 
having  eliminated  the  loads  and  stores  at  compile  time.  Therefore,  there 
were  no  "theoretical"  loads  and  stores  to  be  accoimted  for  by  the 
*SVDEDICATED  and  *SVGENERAL  input  parameters.  The  recommended 
output  values  will  then  be  1.0  for  the  "Used"  values  indicating  no  data 
reuse  and  0.0  for  the  "percentage"  values  indicating  no  work  is  attributed 
to  this  access  pattern. 

In  subsections  5.5-5.8,  the  following  conventions  will  be  used: 

ADJMEM  =  the  adjusted  number  of  memory  operations. 

Cl  LI  =  the  number  of  LI  cache  misses  attributed  to  anything  other  than  a 
STRIDE-N  (or  random)  access  pattern.  This  can  result  from  either  a  STRIDE-1 
access  pattern  or  a  second  larger  working  set  that  fits  into  the  L2  but  not  the  LI 
cache. 

C1L2  =  the  number  of  L2  cache  misses  attributed  to  anything  other  than  a 
STRIDE-N  (or  random)  access  pattern.  Primarily,  this  is  expected  to  be  the 
result  of  a  STRIDE-1  access  pattern. 

CGBLl  =  The  LI  cache  misses  associated  with  a  larger  working  set  (e.g.,  from  a 
blocked  access  pattern  involving  large  "global"  arrays). 

CN  =  the  number  of  cache  misses  attributed  to  a  STRIDE-N  (or  random)  access 
pattern. 

DSIZE  =  tlie  size  of  the  data  item  in  bytes  (usually  8). 

DPERLl  =  LILSIZE/DSIZE  =  the  number  of  data  elements  per  LI  cache  tine. 

DPERL2  =  L2LSIZE/ DSIZE  =  the  number  of  data  elements  per  L2  cache  line 

LILSIZE  =  the  size  of  the  cache  line  in  the  LI  cache. 

LIMISS  =  the  number  of  LI  cache  misses. 

L1PERL2  =  L2LS1ZE/ LILSIZE  =  the  number  of  LI  cache  lines  per  L2  cache  lines. 

L2LSIZE  =  the  size  of  the  cache  line  in  the  L2  cache. 
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L2MISS  =  the  number  of  L2  cache  misses. 

L2PERPAGE  =  PAGESIZE/L2LSIZE  =  the  number  of  L2  cache  lines  per  page. 

NLOADS  =  the  number  of  LOADS  that  graduated  (completed). 

NSTORES  =  the  number  of  STORES  that  graduated. 

PAGESIZE  =  the  page  size  in  bytes. 

PERGl  =  The  percentage  of  the  memory  operations  with  a  STRIDE-1  access 
pattern.  Initially,  this  value  will  be  set  to  0.0. 

PERGB  =  The  percentage  of  the  memory  operations  associated  with  the  blocked 
memory  access  pattern  associated  with  the  larger  working  set. 

PERGN  =  The  percentage  of  the  memory  operations  with  a  STRIDE-N  access 
pattern. 

PERSA  =  The  percentage  of  the  memory  operations  associated  with  the  use  of 
small  scratch  arrays  in  the  smaller  working  set. 

TLBMISS  =  the  number  of  TLB  misses. 

USEGl  =  The  data  use/  reuse  for  the  STRIDE-1  access  pattern.  Initially,  this  value 
will  be  set  to  1.0,  indicating  no  reuse.  However,  if  die  value  for  PERGl  changes 
from  its  initial  value  of  0.0,  this  nximber  will  be  recalculated. 

USEGB  =  The  data  use/ reuse  associated  witii  the  blocked  memory  access  pattern 
associated  with  the  larger  working  set. 

USEGN  =  The  data  use/ reuse  for  the  STRIDE-N  access  pattern.  This  will  always 
be  hardwired  as  1.0,  indicating  no  reuse.  This  does  not  really  matter  since  any 
reuse  that  does  occur  wiU  simply  be  charged  to  another  access  pattern  m  a 
manner  that  does  not  result  in  additioncd  TLB  or  cache  misses. 


5.5  Solving  for  the  STRIDE-N  Access  Pattern  Parameters 


The  next  stage  of  the  process  is  to  solve  for  the  STRIDE-N  access  parameters, 
since  that  will  tell  us  how  many  L2  misses  remain  to  be  allocated  among  the 
remaining  access  patterns.  Again,  reasonable  checks  wiU  be  made  and,  if 
necessary,  the  values  will  be  adjusted  accordingly.  The  need  for  this  can  have 
any  number  of  somces  (e.g.,  extraneous  cache  and  TLB  misses  due  to 
timesharing  a  processor  or  alternatively  process  migration  on  a  shared  memory 
SMP).  The  tool  will  now  solve  the  foUowing  system  of  equations: 


L2MISS  =  C1L2  +  CN. 
C1L2 


TLBMISS =- 


L2PERPAGE 


+  CN. 


(59) 

(60) 


Solving  for  C1L2  and  CN,  one  comes  up  with  the  foUowing  equations: 


C1L2  = 


L2MISS -TLBMISS 


L2PERPAGE 


CN  =  L2MISS-C1L2. 


(61) 

(62) 


19 


Performing  the  sanity  checks,  one  ends  up  with 
IfClL2<0.0,then 

C1L2  =  0.0  and 

CN  =  L2MISS.  (63) 

Otherwise,  if  CN  <  0.0,  then 


C1L2  =  L2MISS  and 

CN  =  0.0. 

(64) 

ADJMEM  =  NLOADS  +  NSTORES. 

(65) 

CN 

PERGN= 

(66) 

ADJMEM 

ADJMEM  =  ADJMEM  -  CN. 

(67) 

ClLl  =  LIMISS  -  CN. 

(68) 

CGBL1  =  C1L1-C1L2  ♦  L1PERL2. 

(69) 

5.6  Solving  for  the  Blocked  Access  Pattern  Parameters 

Now  that  the  STRIDE-N  access  pattern  has  been  accoimted  for  and  the  number 
of  memory  operations  and  cache  misses  that  remain  to  be  accounted  for  is 
known,  we  proceed  to  the  question  of  the  existence  of  a  large  working  set  which 
will  live  out  of  the  L2  cache.  The  heuristic  that  will  be  used  at  this  point  is 
somewhat  arbitrary  but  is  based  on  the  concept  that  tmless  there  is  a  reasonable 
amount  of  data  reuse,  one  cannot  say  that  a  working  set  exists. 

If - — - >4.0,  then  we  have  a  large  working  set,  and  the  following 

L1PERL2*C1L2  6  & 

equations  are  used: 

PERGl-0.0.  (70) 

USEG1  =  1.0.  (71) 

This  implies  that  all  of  the  remaining  L2  cache  misses  will  be  charged  to  the 
blocked  access  pattern,  with  no  work  assigned  to  a  STRIDE-1  access  pattern.  This 
is  a  somewhat  arbitrary  assignment.  However,  since  both  access  patterns  will 
produce  the  same  ratio  between  TLB  misses  and  L2  cache  misses,  this  should  not 
be  a  problem  as  long  as  the  working  set  fits  into  the  cache.  If  one  moves  onto  a 
system  that  lacks  an  L2  cache  or  where  the  cache  is  too  small,  then  one  needs  to 
specify  if  ENVELOPE  is  to  treat  the  resulting  access  pattern  as  if  it  is  STRIDE-1  or 
STRIDE-N.  The  recommended  default  when  using  tliis  tool  is  STRIDE-1,  which  is 
specified  by  negating  the  estimated  size  of  the  working  set  to  be  discussed  in 
more  detail  in  section  5.8. 
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The  tool  now  assumes  that  all  of  the  LI  cache  misses  are  the  result  of  this  larger 
working  set,  since  if  a  smaller  working  set  also  exists,  it  will  live  out  of  the  LI 
cache.  As  such,  the  smaller  working  set  is  expected  to  have  a  negligible  cache 
miss  rate.  This  corresponds  to  the  use  of  small  scratch  arrays  (the  SA  input 
parameters  for  ENVELOPE). 


ADJMEM  =  ADJMEM-C1L1  ♦  DPERLl. 

(72) 

If  ADJMEM  <  0.0,  then 

ADJMEM  =  0.0. 

(73) 

_  ClLl  *  DPERLl 

NLOADS  +  NSTORES' 

(74) 

USEGB  = 

L1PERL2  *  C1L2 

(75) 

PERSA= 

NLOADS  +  NSTORES 

(76) 

The  recommended  size  for  the  large  working  set  is  1.0  MB,  with  a  STRIDE-l 
access  pattern  if  the  cache  is  too  small.  The  choice  of  1.0  MB  is  somewhat 
arbitrary  but  is  based  on  most  SGI  systems  in  recent  years  using  a  L2  cache  size 
of  1-8  MB.  Furthermore,  most  of  the  competing  systems,  when  equipped  with  a 
large  cache,  also  have  a  cache  size  of  at  least  1  MB.  However,  experience  has 
indicated  that  some  of  the  NAS  benchmarks  have  a  large  working  set  small 
enough  to  fit  in  the  caches  of  the  Cray  T3E  and  the  IBM  SP  with  Power  2  Super 
Chips.  Therefore,  prudence  dictates  that  one  might  want  to  compare  the 
predicted  performance  to  the  measured  performance  on  one  of  these  systems  in 
an  attempt  to  fine-tune  this  parameter.  All  we  know  for  certain  is  that  the  size  of 
the  larger  working  set  is  somewhere  between  the  size  of  the  LI  and  L2  caches. 
This  concludes  the  handling  of  the  situation  in  which  a  large  working  set  occurs. 
Subsections  5.7  and  5.8  only  apply  to  the  situation  where  a  working  set  is  either 
missing  or  not  very  effective. 

5.7  Checking  for  the  Case  of  a  Small  Working  Set  Without  a  Large 
Working  Set 

The  tool  starts  out  by  setting  the  parameters  that  describe  the  larger  working  set, 
causing  that  access  pattern  to  be  skipped  by  ENVELOPE. 

PERGB  =  0.0.  (77) 

USEGB  =  1.0.  (78) 

Once  again,  the  tool  uses  a  heuristic  to  check  to  see  if  a  small  working  set  exists 
and  is  effective. 
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If - >  4.0,  then  we  have  a  small  working  set.  This  means  that 

DPERLl  *  CGBLl  ^ 

there  is  a  cache  resident  small  scratch  array.  The  tool  now  calculates  what 

percentage  of  the  memory  operations  involve  this  small  working  set  and  what 

percentage  needs  to  be  mapped  to  the  STRIDE-1  access  pattern  to  achieve  the 

correct  number  of  L2  cache  misses  and  TLB  misses.  It  should  be  noted  that  the 

small  working  set  is  assumed  to  have  an  insignificant  number  of  L2  cache  misses 

and  TLB  misses  associated  with  it.  Since  the  STRIDE-1  access  pattern  is  being 

used  only  to  the  extent  necessary  to  account  for  the  L2  cache  misses  and  the  TLB 

misses,  it  will  be  assigned  a  data  use/ reuse  value  of  1.0,  indicating  that  all  data 

reuse  is  associated  with  the  small  working  set. 


per  O  A  _  ADJMEM  -  C1L2  ♦  DPERL2 
NLOADS  +  NSTORES 

(79) 

If  PERSA  >  1.0,  then 

PERSA  =  1.0. 

(80) 

PERGl  =  1.0  -  PERSA  -  PERGN. 

(81) 

If  PERGl  >  1.0,  then 

PERGl  =  1.0. 

(82) 

If  PERSA  <  0.0,  then 

PERSA  =  0.0. 

(83) 

USEGl  =  1.0. 

(84) 

The  last  remaining  value  is  the  estimated  size  of  the  smaller  working  set.  All  we 
know  for  certain  is  that  it  has  to  fit  into  the  32  kB  cache  of  the  MIPS  RlOK  or 
R12K  processor  of  the  system  that  has  been  used.  It  probably  is  somewhat 
smaller  than  that,  so  the  tool  recommends  the  value  of  12  kB,  which  is  a  safe 
number  for  almost  all  of  the  RISC  processors  made  since  1990.  The  tool  has  now 
completed  its  task,  and  subsection  5.8  should  be  skipped. 


5.8  Handling  the  Case  Where  No  Working  Sets  Exist 

The  tool  has  now  determined  that  no  working  sets  exist,  so  all  of  the  data  access 
must  be  mapped  to  either  a  STRIDE-N  access  pattern  or  a  STRIDE-1  access 
pattern.  In  subsection  5.5,  the  portion  mapped  to  the  STRIDE-N  access  pattern 
was  calculated,  leaving  the  STRIDE-1  access  pattern  to  be  handled  now. 

PERSA  =  0.0.  (85) 

PERGGl  =  1.0  -  PERGN.  (86) 


USEGl  = 


ADJMEM 
DPERL2  *  C1L2 


(87) 
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The  only  complicated  part  of  this  is  that  any  data  reuse  that  occurs  must  now  be 
mapped  to  the  STRlDE-1  access  pattern,  as  was  done  in  the  last  equation.  This 
concludes  the  discussion  of  the  equations  and  logic  behind  this  tool. 


6.  Future  Work 


Work  is  currently  rmder  way  to  improve  the  usability  of  this  code.  Additionally, 
research  has  been  initiated  to  try  and  identify  what  characteristics  of  a  parallel 
code  need  to  be  taken  into  consideration  when  estimating  tire  performance  of  a 
parciUel  program.  Unfortxmately,  our  initial  experience  in  this  area  indicates  that 
this  is  a  highly  complex  problem  that  is  probably  too  difficult  to  tackle  in  the 
general  case.  We  hope  that  in  the  future,  we  will  be  able  to  produce  useful 
simulators  for  some  of  the  more  commonly  foxmd  cases. 


7.  Results  and  Conclusions 


We  have  created  an  entirely  new  simulator  based  on  Back-of-fhe-ENVELOPE 
calculations  that  is  capable  of  simulating  the  performance  of  computationally 
intensive  workloads  in  a  short  fixed  amoimt  of  time.  An  associated  tool  that 
makes  the  simulator  friendlier  to  use  has  also  been  discussed.  Experience  with 
using  ENVELOPE  has  shown  that  in  almost  all  cases,  it  can  accurately  predict  the 
performance  of  the  user's  code  to  within  a  factor  of  2  of  the  measured  value. 
Furthermore,  in  many  cases,  we  were  able  to  achieve  agreement  with 
experimental  results  to  within  ±10-15%. 
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Table  1.  Input  parameters  for  an  IBM  SP  with  375  MHz  for  the  Linpack  Power  3  Thin  SMP  nodes  100  x  100  benchmark. 
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Table  1.  Input  parameters  for  an  IBM  SP  with  375  MHz  for  the  Linpack  Power  3  Thin  SMP  nodes  100  x  100  benchmark  (continued). 
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Table  1.  Input  parameters  for  an  IBM  SP  with  375  MHz  for  the  Linpack  Power  3  Thin  SMP  nodes  100  x  100  benchmark  (continued). 
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Table  2.  Input  parameters  for  an  IBM  SP  with  375-MHz  Power  3  Thin  SMP  nodes  for  the  CG  NAS  benchmark  (class  B  using  MPI)  with 
prefetching  "disabled." 
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Table  3.  A  comparison  of  predicted  results  from  ENVELOPE  to  measured  results. 


30 


Table  3.  A  comparison  of  predicted  results  from  ENVELOPE  to  measured  results  (continued). 
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Table  3.  A  comparison  of  predicted  results  from  ENVELOPE  to  measured  results  (continued). 
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Table  3.  A  comparison  of  predicted  results  from  ENVELOPE  to  measured  results  (continued). 
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Table  3.  A  comparison  of  predicted  results  from  ENVELOPE  to  measured  results  (continued). 
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Table  3.  A  comparison  of  predicted  results  from  ENVELOPE  to  measured  results  (continued). 


Source 
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the  adjustments  were  made,  the  need  ahd  appropriateness  was  immediately  apparent  since  the  difference  in  the  predicted  level  of  performance  might 
vary  by  as  much  as  a  factor  of  10. 


The  available  memory  bandwidth  and  memory  latency  for  a  shared  memory  system  can  be  difficult  to  get  right.  If  the  measurements  were  made  using 
a  single  processor  on  a  dedicated  system,  then  the  measured  level  of  performance  might  be  artificially  inflated  (the  job  can  use  more  than  its  fair  share 
of  the  memory  bandwidth  for  prefetching.  Similarly,  if  the  job  can  be  "locked"  onto  a  single  processor  of  a  system  with  nonuniform  memory  access 
times,  then  the  latency  might  be  significantly  less  than  would  be  otherwise  measured  (e.g.,  on  the  SGI  02K,  this  can  affect  performance  by  up  to  a  factor 
of  3). 
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Table  4.  A  sample  run  of  the  program  that  uses  Perfex  data  to  suggest  the  input 
parameters  for  use  with  the  program  ENVELOPE. 


$  envelope.perfeX'guide 

This  program  is  designed  to  request  a  limited  amount  of  information  (some  hardware  and  some  from 
running  PERFEX  or  a  similar  tool)  and  then  to  output  a  recommended  set  of  input  for  some  of  the 
input  values  requested  by  the  program  envelope.  This  program  makes  heavy  use  of  heuristics,  so  in 
no  way  is  it  as  accurate  as  a  line-by-line  analysis  of  the  source  code.  However,  in  many  cases,  it  will  be 
good  enough.  One  point  of  caution:  The  values  for  "the  scratch  array  memory  footprint"  and  "the 
block  size"  are  guesses.  They  could  be  larger  than  these  guesses  (up  to  the  limits  of  the  size  of  the  LI 
and  L2  cache,  respectively).  It  is  even  possible  that  work  assigned  to  large  global  blocked  arrays 
represents  a  second  worldng  set  that  should  be  assigned  to  the  scratch  arrays  or  vice-versa.  The 
rationale  for  doing  things  the  way  they  have  been  done  is  that  it  supports  two  distinct  working  set 
sizes  within  the  constraints  of  the  ENVELOPE  program. 

We  will  start  off  by  trying  to  estimate  the  number  of  floating  point  MADD,  ADD,  and  MULTIPLY 
instructions.  This  is  an  imperfect  process.  In  particular,  it  is  sometimes  difficult  to  know  what  to  call  a 
MADD,  since  the  SGI  hardware  can  efficiently  process  independent  ADDS  and  MULTIPLIES  in  the 
same  cycle.  In  theory,  this  can  result  in  up  to  a  factor  of  2  difference  between  the  predicted  and 
measured  levels  of  performance.  The  only  solution  to  this  problem  is  to  compare  the  prediction  for  the 
system  used  to  run  PERFEX,  with  the  measured  number,  and  to  then  fine-tune  the  numbers 
accordingly. 

This  section  of  the  program  can  work  in  three  ways: 

1)  Skip  this  section  entirely. 

2)  Combine  Perfex  data  with  an  a  priori  knowledge  of  the  total  number  of  floating  point  operations  to 
estimate  things. 

NOTE:  The  a  priori  knowledge  can  be  easily  gained  by  measuring  the  number  of  floating  point 
instructions  with  MADDS  turned  off.  On  the  SGI  Origin,  this  is  done  by  compiling  with  the  -mips3 
option. 

3)  Combine  Perfex  data  with  numbers  from  the  Cray  Hardware  Performance  Monitor  to  estimate 
things. 

What  do  you  want  to  do  (enter  1, 2,  or  3)? 

2 

What  is  the  total  number  of  floating  point  operations? 

5.8937688E10 

What  is  the  number  of  Graduated  Floating  Point  instructions  (from  Perfex)? 

27917089584 

For  the  purpose  of  running  ENVELOPE,  it  is  estimated  that  there  are: 

2.9468844E+10  the  number  of  madds 
O.OOOOOOOE+00  the  number  multiplies,  etc. 

O.OOOOOOOE+00  the  number  of  adds,  etc. 

NOTE:  Given  the  input,  it  is  generally  impossible  to  precisely  know  the  ratio  between  ADD  and 
MULTIPLY  instructions,  but  for  the  purpose  of  this  program,  it  doesn't  matter. 

Unless  you  know  the  memory  footprint  (e.g.,  use  the  number  from  TOP  for  RSS),  you  might  want  to 
assume  1024  MB. 

Unless  you  know  the  size  of  the  group,  assume  1 . 

Unless  you  know  the  density  of  the  group,  assume  1. 

Unless  the  code  does  a  lot  of  integer  operations,  other  than  for  address  calculation,  assume  0.0. 
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Table  4.  A  sample  run  of  the  program  that  uses  Perfex  data  to  suggest  the  input 
parameters  for  use  with  the  program  ENVELOPE  (continued). 

What  is  the  line  size  for  the  LI  cache  in  bytes  (32  bytes  on  the  02K)? 

32 

What  is  the  line  size  for  the  L2  cache  in  bytes  (128  bytes  on  the  02K)? 

128 

What  is  the  size  of  a  page  of  memory  (for  an  Origin  2000  or  Origin  3000,  use  16  kB)  in  kB? 

16 

What  is  the  size  of  the  data  item  in  bytes  (usually  8)? 

8 

How  many  LOADS  graduated  (from  Perfex)? 

84240901392 

How  many  STORES  graduated  (from  Perfex)? 

1128365872 

What  is  the  LI  Miss  rate  (from  Perfex)? 

36720735 _ 3552 

What  is  the  L2  Miss  rate  (from  Perfex)? 

2598239936 

What  is  the  TLB  Miss  rate  (from  Perfex)? 

22215984 

Additional  values  to  use  as  input  for  ENVELOPE  are  as  follows, 

128.0000  cache  line  size  in  by  tes 
16.00000  page  size  in  kB 

627,6436  the  amount  of  data  being  loaded  into  the 
proc.  In  GB 

8,406981  the  amount  of  data  being  stored  by  the 
proc.  In  GB 

1.000000  UsedRLSVDEDICATED 
O.OOOOOOOE+00  Percentage  LSVDEDICATED 
1.000000  Used  RSSVDEDICATED 
O.OOOOOOOE+00  Percentage  SSVDEDICATED 
O.OOOOOOOE+00  Percentage  SVM ADDS 
O.OOOOOOOE+00  Percentage  SVMULTIPLIES 
O.OOOOOOOE+00  Percentage  SVIOPS 
O.OOOOOOOE+00  Percentage  SSVDEDICATED 
O.OOOOOOOE+00  Percentage  SSVDEDICATED 
1.000000  Used  RLSVGENERAL 
O.OOOOOOOE+00  Percentage  LSVGENERAL 
1 .000000  Used  RSSVGENERAL 

O.OOOOOOOE+00  Percentage  SSVGENERAL 
1.000000  Used  RLSADEDICATED 
O.OOOOOOOE+00  Percentage  LSADEDICATED 
1.000000  Used  RSSADEDICATED 
O.OOOOOOOE+00  Percentage  SADEDICATED 
O.OOOOOOOE+00  Percentage  SAM  ADDS 
O.OOOOOOOE+00  Percentage  SAMULTIPLIES 
O.OOOOOOOE+00  Percentage  S  A  ADDS 
O.OOOOOOOE+00  Percentage  SAIOPS 
1,000000  Used  RLSAGENERAL 
O.OOOOOOOE+00  Percentage  LSAGENERAL 
1.000000  Used  RSSAGENERAL 
O.OOOOOOOE+00  Percentage  SSAGENERAL 
1.1700000E-02  the  scratch  array  memory  footprint 
1.000000  UsedCLGB 

O.OOOOOOOE+00  Percentage  LGB 
1.000000  UsedCSGB 

O.OOOOOOOE+00  Percentage  SGB 
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Table  4.  A  sample  run  of  the  program  that  uses  Perfex  data  to  suggest  the  input 
parameters  for  use  with  the  program  ENVELOPE  (continued). 


O.OOOOOOOE+00  Percentage  GBMADDS 
O.OOOOOOOE+00  Percentage  GBMULTIPLIES 
O.OOOOOOOE+00  Percentage  GBADDS 
O.OOOOOOOE+00  Percentage  GBIOPS 
"1.000000  the  block  size 

1.000000  UsedCLGN 

2.2634468E-03  Percentage  LGN 
1.000000  UsedCSGN 

2.2634468E-03  Percentage  SGN 
2.2634468E-03  Percentage  GNMADDS 
2.2634468E-03  Percentage  GNMULTIPLIES 
2.2634468E-03  Percentage  GNADDS 
2.2634468E"03  Percentage  GN  lOPS 
2.055018  UsedCLGl 

99.99773  Percentage  LGl 

2.055018  Used  CSGl 

99.99773  Percentage  SGI 

99.99773  Percentage  GIMADDS 

99.99773  Percentage  G1  MULTIPLIES 

99.99773  Percentage  G1  ADDS 

99.99773  Percentage  CHOPS _ 
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Glossary 


ARL 

U.S.  Army  Research  Laboratory 

BLAS 

Basic  Linear  Algebra  Subprograms 

CFD 

Computational  Fluid  Dynamics 

CISC 

Complicated  Instruction  Set  Computer  -  an  approach  to 
processor  design  that  assumes  that  the  best  way  to  get 
good  performance  out  of  a  system  is  to  provide 
instructions  that  are  designed  to  implement  key 
constructs  (e.g.,  loops)  from  high-level  languages 

CSM 

Computational  Structural  Mechanics 

CPU 

Central  Processing  Unit 

GFLOPS 

Billion  Floating  Point  Operations  per  Second 

High-Level  Languages 

Computer  languages  that  are  designed  to  be  relatively 
easy  for  the  programmer  to  read  and  write.  Examples  of 
this  type  of  language  are  FORTRAN,  COBOL,  C,  etc. 

kB 

Thousand  Bytes 

Low-Level  Languages 

Computer  languages  that  are  designed  to  reflect  the 
actual  instruction  set  of  a  particular  computer.  In 
general,  the  lowest  level  language  is  known  as  Machine 
Code.  Just  slightly  above  Machine  Code  is  a  family  of 
languages  collectively  known  as  Assembly  Code. 

MB 

Million  Bytes 

MFLOPS 

Million  Floating  Point  Operations  per  Second 

MHz 

Million  Hertz  (cycles/second) 

MPI 

Message-Passing  Interface 

MSRC 

Major  Shared  Resource  Center 

NAS 

Numerical  Aerospace  Simulation -a  division  of  the 
Information  Sciences  and  Technology  Directorate  at 
NASA  Ames  Research  Center,  Moffett  Field,  CA 

PAPI 

Performance  Application  Programming  Interface 
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RISC 


SMP 

SPEC 


Reduced  Instruction  Set  Computer  -  an  approach  to 
processor  design  that  argues  that  the  best  way  to  get 
good  performance  out  of  a  system  is  to  eliminate  the 
Micro  Code  that  CISC  systems  use  to  implement  most  of 
their  instructions.  Instead,  all  of  the  instructions  will  be 
directly  implemented  in  hardware.  This  places  obvious 
limits  on  the  complexity  of  the  instruction  set,  which  is 
why  the  complexity  had  to  be  reduced. 

Symmetric  Multiprocessor 

Standard  Performance  Evaluation  Corporation  -  a 
company  formed  to  create  industry  standard 
benchmarks  (mostly  for  desktop  systems) 
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